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ABSTRACT 


Stemmer is used for reducing inflectional or derived word to its stem. This 
technique involves removing the suffix or prefix affixed in a word. It can be 
used for information retrieval system to refine the overall execution of the 
retrieval process. This process is not equivalent to morphological analysis. 
This process only finds the stem of a word. This technique decreases the 


number of terms in information retrieval system. There are various 

techniques exists for stemming. In this paper, a new web-based stemmer has 
Keywords: been proposed named as “Mula” for Odia Language. It uses the Hybrid 
approach (i.e. combination of brute force and suffix removal approach) for 
Odia language. The new born stemmer is both computationally faster and 
: domain independent. The results are favourable and indicate that the 
Inflectional words proposed stemmer can be used effectively in Odia Information Retrieval 
Information retrieval systems. This stemmer also handles the problem of over-stemming and 
Mula under-stemming in some extend. 
Odia stemmer 


Brute force 
Derivational suffixes 
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1. INTRODUCTION 

Stemming is a technique for removing inflectional or derived word to its stem. This technique is 
used to remove the suffix or prefix affixed in a word. This process finds stem of a word. Stemmer is 
essential for information retrieval system to refine the performance of the system. Its technique is not 
equivalent to morphological analysis. Its primary objective is to decreases the number of terms in an 
information retrieval system. Stemming technique can be used in information retrieval to decrease as many 
related words to a common form that is not in base form. For example, the English word “Computation” has 
different inflections such as ‘Comput’, ‘Compute, ‘Computing’, Computes’ etc. In this case stemmer can be 
used to reduce derived words into its root or stem word. Many stemmers had been developed for different 
languages, which reduce a word to its root/stem form. It ultimately reduces the index file size in an 
information retrieval system. In this way we can improve recall (i.e. the number of documents retrieved in 
response to a query.) of an IR system by effectively using stemmer in the background. Since many 
derivational words are mapping into one word i.e. root or stem. It ultimately reduces the volume of the index 
files in the IR system. 

There are several types of stemming algorithms exists and it differs in respect to their performance 
and accuracy. There are various algorithms used to find a stem of a word i.e. (a) Brute-force algorithms: It 
uses a lookup table that contains derived words with their corresponding roots. To find the root/stem of a 
word, the table is queried to find a matching inflection word. If a matching inflection word is found, then 
corresponding root returned. Otherwise, it fails. (b) Suffix-stripping Approach: It does not depend on any 
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lookup table that consists of derived or inflected words and their root word relations. It simply uses a set 
"rules" which drives the algorithm. It finds the root/stem of the given input word based on that rules. 
(c) Lemmatization algorithms: This technique also called as text normalization. In Lemmatization root word 
is called Lemma. The POS is first identified of that language and then an attempt will be made to find the 
stem. The stemming rules will change based on a word's POS of that language. Lemmatization process 
ensures that the root or stem of the inflected words belongs to that language (d) Stochastic algorithms: It is 
based on probability method to detect the root form of a word. This trained on a table of root words to 
inflected words relations to develop a probabilistic model. (d) Affix Removal Approach: The name clearly 
suggests this approach is related to removing the suffixes or prefixes of a word. An Affix may be a prefix or 
a suffix. It comes under truncating method of the stemming algorithm. We found affixes are connected with 
nouns in Odia language. One can opt any of the above technique while designing stemmer. (e) Hybrid 
approach: This technique combines more than two methods as discussed above. It may merge the rule-based 
technique along with the probability method. (f) N-Gram Modeling: Many stemming methods used in the n- 
gram technique of a word to select the correct stem for a word. 

Stemming plays a vital role to handle the vocabulary mismatch problem of an IR system. In this said 
problem, the query words mismatch with the document words. For example, when a user input a query word 
and the word does not exist in the vocabulary of the document then it may cause unreliable result. To avoid 
this problem, we have developed a new web-based hybrid stemmer using brute force with enhanced suffix 
stripping algorithm union that can be adopt in the Odia information retrieval system. The new stemmer is 
both computationally faster as well as domain-independent. 


2. RELATED WORKS 

In the study of information retrieval, researchers find stemming plays an important role. Stemming 
is not a new concept. Stemming techniques had invented since 1968. The first stemming algorithm was 
designed by Julie Beth Lovins [1]. After that many researchers continued investigating various approaches to 
this area of study and proposed several algorithms to improve its performance. Another stemmer in English 
was written by Martin Porter [2] in the year 1980. As compared to European languages as well as English, a 
few researches have been discovered in Indian Language. A Hindi stemmer [3] was proposed by Rao, 
Durgesh et al. based on suffix striping approach. A Bengali Morphological analyzer [4] was developed by 
Dasgupta et al. based on suffix striping approach. Stemming is the process by which the user inputs an 
inflected word to the trained model and the model produces the root/stem word according to its rule set. In 
this Paper we have developed a Stemmer based on Hybrid Approach. 


3. LITERATURE SURVEY ON ODIA STEMMER 
These are the few papers published on Odia Stemmer. Table 1 describes the paper details with 
key findings. 


Table 1. Literature survey on Odia stemmer 





Reference Key Findings 





Published a paper on Stemmer for Odia language. They used the suffix stripping approach to remove the 
inflectional suffixes. The limitation of this algorithm was it only predicts 88% accuracy. 
Published a paper FIRE 2012 Submission: MET Track Odia. They had used the affix removal 
algorithm. The system reads input text files from the folder. Firstly, it removes stop words from the 
Balbantray, R.C. et al. [6] input files against the stop word dictionary then matched the token with the root word dictionary. After 
that the input matched the suffix dictionary then removes the suffix and match with the root word. If the 
root word found then there is no further processing required. 
Balabantaray, R.C. et al. [7] Presented a paper on Odia Text Summarization. 
Sethi, Dhabal Prasad [8] Published a paper on Lightweight Stemmer for Odia Derivational Suffix. He used suffix stripping 
method to find the stems. 


Sampa et al. [5] 








4. ODIA DERIVATIONAL MORPHOLOGY 
4.1. Odia morphology 

The formal variants of a morpheme are called allomorphs of that morpheme. The variant may be 
phonologically or morphologically conditioned. A morpheme may be a free or a bound form. Alternatively, 
we can say that a word consists of one or more than one morpheme. From the point of view of its internal 
structure, a word may consist of (i) a root morpheme only (ii) a root and one or more non root morpheme or 
(iii) more than one root morpheme. The non-root morphemes are bound forms and are generally referred to 
as affixes. Roots enter into further morphological constructions and form a base while non-roots do not [9]. 
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4.2. Word formation 

Word formation is concerned with those words which comprise more than one meaningful 
component called morphemes. The common morphological processes, which are involved in word formation, 
are inflection, derivation, reduplication, echo formation and contraction. The word formation process is 
shown in Figure 1. 


4.2.1. Inflection 

Inflection is a morphological process by which words are formed with the help of bound forms, 
which are called inflectional affixes. Inflected words belong to the same form-class to which the root 
word belongs. 


4.2.2. Derivation 

Derivation is a morphological process, which is concerned with the structure of the stems. In other 
words, word stems are formed by derivation. Two types of this process are generally distinguished and they 
are compounding and derivation. Compounding is a derivational process in which a stem is formed with two 
roots, the resultant stem belonging to the form class of at least one of the constituent roots. Derivation is a 
process of word formation in which a stem is formed with two roots or a root and an affix and the resultant 
stem does not belong to the form class of any of the constituents. Both inflectional and derivational affixes 
are involved in affixation. Depending on their position of occurrence with respect to the root, the affixes are 
classified into prefixes, suffixes and infixes. Prefixes precede the root, suffixes follow it and infixes occur 
within the root. 


4.2.3. Reduplication 

Laurel J. Brinton in his structure of English: A Linguistic Introduction defines “Reduplication is a 
process similar to derivation, in which the initial syllable or the entire word is doubled, exactly or with a 
slight morphological change.” Reduplication is another morphological process in which a part of a root or the 
root itself is added to the root. This type of word formations is popular in Odia language. 


4.2.3. Echo formation 

The partial repetition of a phoneme or syllable of the base may be called an echo-formation. In other 
words, if the initial phoneme/syllable of the base is replaced by another phoneme or syllable it has neither 
any individual occurrence nor any meaning of its own. It may be called as echo-formation. 


4.2.4. Contraction 

Contraction is a process of word formation in which a syllable is dropped from the root. In Odia 
words are formed using different morphological process viz., inflection, compounding derivation, affixation, 
reduplication and contraction. Both prefixes and suffixes occur in Odia. The prefixes are used to form 
derived adjectives, verbal noun, agent noun, collective and reciprocals. The suffixes denote gender, number, 
case, tense, aspect, and mood. 
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Figure 1. Word formation process 
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The formal variants of a morpheme are called allomorphs of that morpheme. The variant may be 
phonologically or morphologically conditioned. A morpheme may be a free or a bound form. Alternatively, 
we can say that a word consists of one or more than one morpheme. From the point of view of its internal 
structure, a word may consist of (i) a root morpheme only; (ii) a root and one or more non root morpheme or; 
(iii) more than one root morpheme. The non-root morphemes are bound forms and are generally referred to 
as affixes. Roots enter into further morphological constructions and form a base while non-roots do not [9]. 
Odia morphology deals with the analysis, identification and description of structure of morpheme. 
Morphology deals with the structure of words. The basic unit is the focus of study in morphology is 
morpheme. For example: The word @I@@els@ the morphemes are MRG, ASQ. Morpheme is not always 
conveying a meaningful word in Odia. Any morpheme in Odia should be a root word, prefix or suffix. 
Morphemes are divided into five categories shown in Figure 2. 

The morpheme which are independent called free morpheme. Those morphemes are standalone in 
nature. It does not need to add with other to create a word. Examples of free and bound morpheme: 


AIF AIG HAT | AIA AIG(Q) GAs 


The morpheme @l@ is a stand-alone morpheme and morpheme (Q) is a suffix. Most of the morphemes are 
bound type in Odia language. 





Figure 2. Type of morpheme 


4.3. Odia derivational morphology 

Derivational morphology deals with the addition of derivational suffixes with word stem to form 
word of different class (different part-of-speech). Like English, Odia derivational suffixes are added with 
root word to form different part-of-speech. They are in Table 2. 


Table 2. Detail description of Odia derivational morphology 
Categories Examples 
Noun word + Derivational suffix = Adjective category 








Noun to Adjective RONG = QEAR 


Adjective word + derivational suffix = noun words 
Adjective to Noun AYSE + ozay! 


Adjective word + Suffix = Adjective word 





Adjective to Adjective agaaga 


Verbal word + Derivational suffix =Adjective word 


Verb to Adjective aie+azl= aĝe 


Verbal word + Derivational suffix = Noun word 


Verb to Noun 98+ 2dll= RAAI 
6GR+EID=CIRID 
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In derivational stemming, words that are derived, either by adding affixes to that stems or by 
performing changes at the morpheme boundary, are reduced to their stem form. Odia language has strong 
inflectional system can be classified as nominal inflection and verb inflection. Here we represent the rules 
using Panini Grammar. For example, ‘@mrlq¢@’ here stem RMA and suffix is JE®. The details of nominal 
suffix and Verbal suffix are given in Table 3 and Table 4 [5]. In Odia we find some prefixes which is 
attached only on noun. There is 20 such type of prefixes in Odia. These are basically from Sanskrit. They are 
as shown in Table 5. 


Table 3. List of nominal suffixes in Odia 








3a RaR agao e Fi 
. A ase- oncase- 
ee (Singular) lira) Relationship) Relationship) 
gari NIG,69,6,61,0N,01N, 228 NGI, JES, AS AER , ral 
(1* Inflection) 
ç 60,39, Ra = p 
Tola LSe, VILT È FISIG, ASQ AR, ae 
(2™ Inflection) O1gd@,0g, 
SPA ARR COR MPS FIA, 
gea, EaRTIA, ETA, GIAI, aad 
(3" Inflection) AREER 
T GF, AG,ARS,S,§, TAT MIG ISIE, JEG ANT, 
n Le ; FAES ,@ EAA .@ Gases ,@ alG, Gag ,laesGacasocaigeales , agele 
eecnon) @ AMIE ARIES agiea 
oy ARTOA, ARTO, 
79 : QOG OQ Ol, a g v AAIR 
(5" Inflection) TOQAT Q ART OB , 
gat Q5,0ID NI SAS, = = 
; career ARTTA ARTA ARRA , age 
(6" Inflection) SORO ,C2, 
agen 6A,602,0,01IN,M,016Q, AREER ARR Old ARROSA, er 
(7" Inflection) 6&0 , GAG COR ARR O 








Table 4. Odia verbal suffix 




















G@l@ (Tense) QQA (Person) NG Gee (Singular Suffix) eg Gee (Plural Suffix) 
gaa gqe ag, aR, ag, QY, AA, QAQ, Q 
cA AR Gola QQA AUG, AY, AAG, Q aeg, AB, QAS 
(Present Tense) gora aga aa, a, 228, Qara, ae cals can 4 
QAICS, GAAS, S, AAI! 
gai QQA AN,AE, DAM, AM, MAG, QF AR, MIR. AIM, AAG 
AG ala 3010 QQA AP, DY, AAR, AAR, QAG AM, DB, DAP, AAM, AAM, QAS 
(Past Tense) goa gga RAL Sema ee ASD AEA, ABS, AACA, QAER, GAMES 
AAAM, QUEM, QAZISI, QAZIES 
gaa gga AG, AAS, A, QAG AG, AAG, AAG, AAG, QAI 
AIAG AIF 3010 QQA A, RAG, AAG, AAG AS, DAG, ALIG, AJC 
(Future Tense) gora qae AM, ACS, AJEG, AAG, ALAS, ALIES, AQA, Ase, AAGA, ZIGG, AAGE 
Qe2Ee 
Table 5. List of Odia prefixes 
q aal aa aq g 
aa q EQ ag aa 
ge ae ag GQ 8 
ae ae ad ad al 





5. PROPOSED METHODOLOGY FOR ODIA STEMMER 

We have proposed a new web-based stemmer based on hybrid approach (i.e. combination of brute 
force and suffix removal approach) [10] for Odia language. The proposed stemmer is both computationally 
inexpensive and domain independent. The algorithm of the proposed stemmer is described in Figure 3. 
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Start 


Enter the word or paragraphs (Input) 
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Search the word by Pattern matching 
Algorithm 
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Match Found Remove Suffix from the Word 























D { YES 








Display Stem Word (Output) 


J 


End 





Figure 3. Flowchart of the algorithm 


Brute force search is called as exhaustive search. It searches all the possible solution from the data. 
Here it searches the root words present in the database. This technique uses a lookup table which contains 
inflected words and root words mapping. This technique [11] we create and store maximum possible 
inflected words along with their corresponding root word in a database table. When we give input to the 
system then brute force search is carry out and it inspects that whether the derivational words exist in the 
database. If the word is present in that table then it will give its corresponding stem or root word. If the word 
is not present in the table then it will go for suffix removal method to handle those words. Suffix removal is a 
rule-based approach in that certain rule set is defined. By applying those rule set suffixes are removed from 
the inflected or derived word, to find the stem/root. The new enhanced approach of suffix stripping 
algorithm. Figure 4 shows the stemmer user interface. 


Start 

Step 1: Enter derivational word that to be stemmed 

Step 2: The system removes the 3 characters suffixes, 2 character suffixes and 1 character suffixes from the 
derivational word if word length greater than three, and two respectively recursively. 

End 


The inflected word is processed by the stemmer in three steps. The steps are shown below. 

a. Input: The inflected Odia word/paragraph is entered as an input to the web-based system. Here 
“AARG” is given as an Input word. 

b. Processing: Derivational/inflected word is searched by brute force method. It matches with the user 
searched word with the words exist in the database table. If the matching word is exist in the database 
then it will provide the stem of the word as output. If mismatch found then it searches for the alternate 
method called suffix stripping method i.e. the algorithm removes the suffixes recursively first 3 
characters, then 2 characters and last 1 character with a condition that the inflected word must be greater 
than the suffix to find the stem/root of the word. 

c. Output Unit: In Output Unit, the result comes after the processing of word. The result after processing is 
“LYR.” 


One Character derivational Suffix: 2, 2l, Q, & ©, , A, G, 9, f, Cl, Q, Al, @l, Ul, @l, QI 

Two Character derivational Suffix: UO, 10, UG, CA, AG, CA, QT, MA, MA, AP, G, AM, QG, AIR, QQ, 
AlQ, AP, QA, AI, AA, ASI, QAI, AS, NM, A8, IA, A81, QAI, idl, CS, ZA, FA, QA, Ad, AK, MGI, AS, 
GGA, NA, GA, AE, TA, AA, AIA, AE, SIS, NA, VMI, NE, SIA, AP, FE, AG, OG, OO. 


Three Character derivational Suffix: 2910, AAR, QAR, SIGE, NAA, AMA, AMA. 
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Figure 4. Stemmer user interface 


6. EVALUATION 


Backspace 


Alt-Gr ctrl 


We have evaluated the stemmer by taking different set of words i.e. 100 words, 200 words, 300 
words and so on to calculate the time taken to extract the root words as shown in Figure 5. We have not 
compared our Odia stemmer with any of the existing stemmer available for Odia language. Nowhere had we 
found the existing result to compare with the proposed stemmer. Table 6 shows the time to extract Odia 


root words. 


Table 6. Time taken to extract Odia root words 














Set No No of Words Time Taken (Sec.) 
Set-1 100 5.07 
Set-2 200 8.92 
Set-3 300 13.27 
Set-4 400 17.18 
Set-5 500 20.29 
Set-6 600 24.84 
Set-7 700 29.11 
Set-8 800 32.85 
Set-9 900 37.66 
Set-10 1000 40.48 
Evaluation Chart 
50 
— 40 
o 
Ê 30 
v 
£ 20 
= 10 
De 
o 
S 
EFF HK FHF FOC HFS SF Ka 
Number of Words 


Figure 5. Evaluation graph 
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7. APPLICATION OF STEMMER 
7.1. Information retrieval 

Stemmer can be used in information retrieval [12] to reduce as many related words to a common 
form which is not in base form. 


7.2. Indexing 
Stemmer can ultimately reduce the indexing size [13] of the documents and thus the retrieval 
process become faster. 


7.3. Auto text summarization 
It reduces a text document to its summary. It can be used for text summarization [14]. 


7.4. Cross-Language Information Retrieval (CLIR) 

Stemmer can be used in cross-language information retrieval [14] to reduce as many related words 
to a common form which is not in base form. Example suppose the user enter the query in English, it 
retrieves relevant document written in Odia. 


8. CONCLUSION 

This stemmer can be played a vital role for the performance of an Odia IR System. It can efficiently 
handle the problem of understemming and over stemming. In future we can extend this research by merging 
few other techniques by including some more data in to the database and also by using extra rules for suffix 
stripping approach. Then we can compare the results with this stemmer result. In this way we can conclude 
which merging technique is computationally faster. Accordingly, we pick the approach for building an Odia 
IR system. In this paper we designed a stemmer algorithm for Odia which removes derivational suffixes from 
derived word. This algorithm uses brute force approach and a new enhanced approach of simple suffix 
removal technique. 
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